Regular Expressions
Regular expressions are a very powerful way to match arbitrary text. The regular expression engine attempts to match the regular expression against the input string. Such matching starts at the beginning of the string and moves from left to right. The matching is considered to be "greedy", because at any given point, it will always match the longest possible substring. For example, if a regular expression could match the substring 'aa' or 'aaa', it will always take the longer option.
Meta Characters
Most characters in a regular expression are "ordinary", which indicates that they have no special meaning and only match themselves. Certain characters, sometimes called "meta characters", have special meanings. To use a meta character as an ordinary character, you need to "escape" it by preceding it with a backslash character (for example, "\*").
The meta characters are described in the following table:
Character |
Description |
. |
The period matches any character. |
[ ] |
The open bracket character indicates a "bracket expression", discussed below. The close bracket character terminates such an expression. |
\ |
The backslash suppresses the special meaning of the character it precedes, and turns it into an ordinary character. To insert a backslash into your regular expression pattern, use a double backslash ('\\'). |
( ) |
The open parenthesis indicates a "subexpression", discussed below. The close parenthesis character terminates such a subexpression. |
* |
Zero or more of the character or expression to the left. Hence, 'a*' means zero or more instances of 'a' . |
+ |
One or more of the character or expression to the left. Hence, 'a+' means one or more instances of 'a'. |
? |
Zero or one of the character or expression to the left. Hence, 'a?' will match 'a' or the empty string ' '. |
{} |
An interval qualifier allows you to specify exactly how many instances of the character or expression to the left to match. For example, 'a{3}' will match 'aaa'. You can also specify two integers separated by a comma to specify a range of repetitions. For example, 'a{2,4}' will match 'aa', 'aaa', or 'aaaa'. Note that '{0,1}' is equivalent to '?'. |
| |
Alternation. This operator indicates that one of several possible choices can match. For example, '(a|b|c)z' will match any of 'az', 'bz', or 'cz'. |
^ $ |
Anchors. A '^' matches the beginning of a string, and '$' matches the end. For example, '^abc' will only match strings that start with the string 'abc'. '^abc$' will only match a string containing only 'abc'. |
Subexpressions
Subexpressions are those parts of a regular expression enclosed in parentheses. There are two reasons to use subexpressions:
- To apply a repetition operator to more than one character. For example, '(fun){3}' matches 'funfunfun', while 'fun{3}' matches 'funnn'.
- To extract subexpressions using the SUBEXPR keyword to the Extract() method.
Bracket Expressions
Bracket expressions (expressions enclosed in square brackets) are used to specify a set of characters that can satisfy a match. Many of the meta characters described above (.*[\) lose their special meaning within a bracket expression. The right bracket loses its special meaning if it occurs as the first character in the expression (after an initial '^', if any).
There are several different forms of bracket expressions, including:
- Matching List: A matching list expression specifies a list that matches any one of the characters in the list. For example, '[abc]' matches any of the characters 'a', 'b', or 'c'.
- Non-Matching List: A non-matching list expression begins with a '^', and specifies a list that matches any character not in the list. For example, '[^abc]' matches any characters except 'a', 'b', or 'c'. The '^' only has this special meaning when it occurs first in the list immediately after the opening '['.
- Range Expression: A range expression consists of 2 characters separated by a hyphen, and matches any characters lexically within the range indicated. For example, '[A-Za-z]' will match any alphabetic character, upper or lower case. Another way to get this effect is to specify '[a-z]' and use the FOLD_CASE keyword on the Extract(), Matches(), or Split() methods.
Special Characters
Special (non-printing) characters are often represented in regular expressions using backslash escape codes, such as \t
to represent a TAB character or \n
to represent a newline character. IDL does not support these backslash codes in regular expressions. Instead, you can use the ASCII value to represent these characters:
ASCII Character |
Byte Value |
Bell |
7b |
Backspace |
8b |
Horizontal Tab |
9b |
Linefeed |
10b |
Vertical Tab |
11b |
Formfeed |
12b |
Carriage Return |
13b |
Escape |
27b |
For example, to represent the TAB character, use the expression STRING(9B)
.
This syntax can be used when comparing strings or performing regular expression matching. For example, to find the position of the first TAB character in a string:
result = string.Split(STRING(9b))
where string is a variable containing the string to be searched.